The main aim is predicting breast cancer patients chance of survival.
- Clean the data
- Augment the data
- Create some plots
- Statistical analysis
- Create the prediction tool
The main aim is predicting breast cancer patients chance of survival.
We are working with a dataset about Breast Cancer that we have obtained from kaggle website
This is the dataset we are working with:
| patient_id | gender | education | treatment_data | id_healthcenter | id_treatment_region | hereditary_history | birth_date | age | weight |
|---|---|---|---|---|---|---|---|---|---|
| 111036008041 | 0 | 4 | 2019 | 1.11e+09 | 1.11e+09 | 1 | 1989 | 30 | 69 |
| 111035996130 | 0 | 6 | 2019 | 1.11e+09 | 1.11e+09 | 0 | 1989 | 30 | 71 |
| 111035971333 | 0 | 5 | 2019 | 1.11e+09 | 1.11e+09 | 0 | 1989 | 30 | 74 |
| 111036018485 | 0 | 5 | 2019 | 1.11e+09 | 1.11e+09 | 1 | 1989 | 30 | 75 |
| 111035985474 | 0 | 1 | 2019 | 1.11e+09 | 1.11e+09 | 0 | 2009 | 10 | 70 |
| 111035903616 | 0 | 3 | 2019 | 1.11e+09 | 1.11e+09 | 1 | 1989 | 30 | 79 |
| BEFORE | AFTER |
|---|---|
| THE COLUMNS ARE DIFFERENT TYPES | EACH COLUMN HAS A CORRECT TYPE |
| 0, 1, 2 VALUES | BOLEAN VARIABLES |
| NAMES WITH /R/N | CLEAN NAMES |
| BIRTH DATE WITH 3 CHARACTERS | BIRTH DATE WITH 4 CHARACTERS |
| BLOOD TYPE 44 | CORRECT BLOOD TYPES ONLY |
| WEIRD WEIGHT/AGE CORRELATIONS | ELIMINATING PEOPLE UNDER 20 YEARS OLD AND 35 KG |
| WOMEN AND MEN | ONLY WOMEN |
We have created some plots in order to fully understand the data and we have done some statistical analysis like MCA analysis. The plots are shown in the following point: “Results”
//: # Variables that affect health (medicines, vicious habits) have a great incidence in breast cancer patients //: # Early menstrual periods before age 12 and starting menopause after age 55 expose women to hormones longer, raising their risk of getting breast cancer
//: # In most cases, when having taking medicine the death is higher (no sense). //: # Not drinking alcohol or smoking improves recovery. //: # When taking alcohol and smoking the death is lower (it doesn’t make any sense) //: # These are absolute values, maybe we should calculate some relative values
We have reached the following conclusions